Unsupervised Phenotype Scoring

ثبت نشده
چکیده

Unsupervised phenotyping is about clustering patients into homogeneous groups (“phenotypes”) based on observed clinical features. An open challenge is to provide a principled way to define those patient phenotypes in a simple and clinically meaningful manner. In this work, we tackle this challenge by proposing a nonnegative matrix factorization approach extracting phenotypes as a set of integer-weighted features, where the weights are constrained to take values from a small integer set. We employ a block alternating update scheme to solve the problem and formulate the problem of extracting the phenotype weights as an integer programming one. As compared to a commonly used scale-and-round heuristic, our approach achieves consistently higher fit in 2 real electronic health record (EHR) datasets. Our proposed interpretation as well as representative scoring-based phenotypes were validated by a medical expert. 1 Problem Motivation Phenotyping refers to the process of identifying sets of measurable disease markers, enabling to perform cohort identification, i.e., separate between cases and control patients for several diseases or disease sub-types of interest [32, 27]. Unsupervised phenotyping is the process of discovering the underlying patient groups or phenotypes directly based on their EHR data. Matrix and tensor factorization has been a very successful machinery for unsupervised EHR-based phenotyping (e.g., [11, 12, 31, 25, 15, 26, 10]). Consider for example the problem setup of Nonnegative Matrix Factorization (NMF) [21] when minimizing the squared Frobenius norm of the error: min U≥0,V≥0 ||X−UV ||F (1) In the context of unsupervised phenotyping, X ∈ RM×N is a non-negative input matrix, whose X(i, j) cell reflects the activity recorded (e.g., event counts) for the i-th (out ofM ) patient with respect to the j-th (out of N ) medical features. Given an input number R of desired phenotypes, the matrix U ∈ RM×R corresponds to a membership matrix of the patients with respect to the R phenotypes. Finally, the matrix V ∈ RN×R provides the phenotypes’ definition: the non-zero elements of the r-th column V(:, r) reveal the relevant medical features to the r-th phenotype. However, one challenge is to clinically interpret those phenotypes as the values in V can be large and hard to interpret. This paper tackles this challenge of improving the interpretability of factorization-based unsupervised phenotyping approaches. There do exist several domains where an application of Problem 1 is largely focusing on achieving a low approximation error; for example, in recommender systems, good prediction power is usually the top priority [18]. On the contrary, in the context of Unsupervised phenotyping, model intepretability is equally (or even more) crucial. More specifically, it is imperative to provide the medical expert with a straightforward protocol of translating the phenotypes’ definition factor V. We argue that this task is quite challenging if V contains (nonnegative) real values, to the point that those values are hidden altogether by the expert and only the feature ranking produced is preserved (e.g., as happens in [12, 31]). A primary reason for the above is that medical experts are used to deal with simple and concise scoring-based descriptions of clinical status (e.g., risk scores 1) [30]. The main intuition behind the development of such scores is simple: the range of the possible values revealing the contribution of each clinical aspect is restricted to a small integer set of consecutive values (e.g., {0, 1, 2, 3}). A value of zero indicates no contribution of the feature and higher scores indicate distinct levels of severity or importance. In MDCalc, one can find a vast amount of such scores used in medicine. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Automatic Clustering Using a Multi-Objective Evolutionary Algorithm With New Validity measure and application to Credit Scoring

In data mining, clustering is one of the important issues for separation and classification with groups like unsupervised data. In this paper, an attempt has been made to improve and optimize the application of clustering heuristic methods such as Genetic, PSO algorithm, Artificial bee colony algorithm, Harmony Search algorithm and Differential Evolution on the unlabeled data of an Iranian bank...

متن کامل

Short Answer Grading Using String Similarity And Corpus-Based Similarity

Most automatic scoring systems use pattern based that requires a lot of hard and tedious work. These systems work in a supervised manner where predefined patterns and scoring rules are generated. This paper presents a different unsupervised approach which deals with students’ answers holistically using text to text similarity. Different String-based and Corpus-based similarity measures were tes...

متن کامل

Applying Unsupervised Learning To Support Vector Space Model Based Speaking Assessment

Vector Space Models (VSM) have been widely used in the language assessment field to provide measurements of students’ vocabulary choices and content relevancy. However, training reference vectors (RV) in a VSM requires a time-consuming and costly human scoring process. To address this limitation, we applied unsupervised learning methods to reduce or even eliminate the human scoring step require...

متن کامل

Optimal Scoring for Unsupervised Learning

We are often interested in casting classification and clustering problems as a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing the Fisher linear discriminant analysis by regression, in the application of unsupervised l...

متن کامل

A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction

Many unsupervised methods for keyphrase extraction typically compute a score for each word in a document based on various measures such as tf-idf or the PageRank score computed from the word graph built from the text document. The final score of a candidate phrase is then calculated by summing up the scores of its constituent words. A potential problem with the sum up scoring scheme is that the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017